\[ \left ( \Omega, \mathcal{B}, P \right ) \]
\[ \left ( \color{#E69F00}{\Omega}, \mathcal{B}, P \right ) \]
Sample Space: Set of outcomes
\[ \left ( \Omega, \color{#56B4E9}{\mathcal{B}}, P \right ) \]
Event Space: Set of events
\[ \left ( \Omega, \color{#56B4E9}{\mathcal{B}}, P \right ) \]
\(\mathcal{B}\) is modeled as a \(\sigma\)-algebra of \(\mathcal{B}\): a set of subsets of \(\Omega\) with the following properties:
\[ \left ( \Omega, \color{#56B4E9}{\mathcal{B}}, P \right ) \]
There can be multiple \(\sigma\)-algebras for a single sample space. For example:
\[ \left ( \Omega, \mathcal{B}, \color{#009E73}{P} \right ) \]
\(P\) is a Probability Function that takes \(\mathcal{B}\) and assigns a number (a probability; \(0\leq p \leq 1\)) to each event in \(\mathcal{B}\).
\[ \left ( \Omega, \mathcal{B}, \color{#009E73}{P} \right ) \]
The probability function must satisfy:
Why was all that important? It allows us to know things like:
which allow us to do all of statistics.
For \(A,B \in \mathcal{B}\) and \(P(B) \neq 0\):
\[ P(A \mid B) = \frac{P(A\cup B)}{P(B)} \]
Conditioning on \(B\) means that \(B\) is now the sample space \(\Omega\) (reduces the population of outcomes we’re looking at)
\[P(A \mid B) = P(A)\]
Learning that \(B\) happened does not give us any information about whether \(A\) happened, and therefore \(A\) and \(B\) are independent.
\[ P(A \mid B) = \frac{P(A\cup B)}{P(B)} = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]
Bayes’ Rule is a way to calculate conditional probabilities
\[ P(\text{covid} \mid +) = \frac{P(+ \mid \text{covid}) \cdot P(\text{covid})}{P(+)} \]
What is the probability of having covid given that you got a positive covid test?
\[ P(\text{covid}) = 0.01 \\ \text{Sensitivity} = 0.65 \\ \text{Specificity} = 0.90 \]
\[ P(\text{theory} \mid \text{data}) = \frac{P(\text{data} \mid \text{theory}) \cdot P(\text{theory})}{\text{data}} \]
“How did my theory change after seeing the data?”
\[ \color{#D55E00}{P(\text{theory} \mid \text{data})} = \frac{P(\text{data} \mid \text{theory}) \cdot P(\text{theory})}{\text{data}} \]
The Posterior is the probability of our theory after seeing the data.
\[ P(\text{theory} \mid \text{data}) = \frac{P(\text{data} \mid \text{theory}) \cdot \color{#CC79A7}{P(\text{theory})}}{\text{data}} \]
The Prior is the probability of our theory before seeing the data.
\[ P(\text{theory} \mid \text{data}) = \frac{\color{#F5C710}{P(\text{data} \mid \text{theory})} \cdot P(\text{theory})}{\text{data}} \]
The Likelihood is the probability of our data, assuming the theory is true (how well the theory fits the data).
\[ P(\text{theory} \mid \text{data}) = \frac{P(\text{data} \mid \text{theory}) \cdot P(\text{theory})}{\color{#009E73}{\text{data}}} \]
The Normalizing Constant is the probability of our data.It normalizes the posterior so that it’s a valid probability (distribution).
It does not matter.
\(P(x | \theta)\)
\(L(\theta | x)\)
Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).
There are \(2^{100}\) ways for these 100 people to respond:
\[\begin{bmatrix} 0 & 1 & 0 & \ldots & 1 \\ 0 & 0 & 1 & \ldots & 1\\ 0 & 0 & 0 & \ldots & 1\\ 0 & 0 & 0 & \ldots & 1\\ \ldots & \ldots & \ldots & \ldots & \ldots\\ 0 & 0 & 0 & \ldots & 1 \end{bmatrix}\]Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).
There are \(2^{100}\) ways for these 100 people to respond: do you care about all \(2^{100}\) ways?
Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).
There are \(2^{100}\) ways for these 100 people to respond: do you care about all \(2^{100}\) ways?
Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).
| No. Votes | No. Outcomes |
|---|---|
| 0 | 1 |
| 1 | 100 |
| 2 | 4950 |
| … | … |
| 100 | 1 |
The summary statistic: number of voters.
Assume \(P(1) = P(0) = 0.5\)
| No. Votes | No. Outcomes | p |
|---|---|---|
| 0 | 1 | \(\frac{1}{2^{100}}\) |
| 1 | 100 | \(\frac{100}{2^{100}}\) |
| 2 | 4950 | \(\frac{4950}{2^{100}}\) |
| … | … | … |
| 100 | 1 | \(\frac{1}{2^{100}}\) |
What is \(P(\text{votes} \leq 40)\) ?
cumulative distribution function: \(F_{X}(x) = P_{X}(X \leq x) = \sum_{x_i: X \leq x} P(x_i)\)
probability mass function: \(f_{X}(x) = P(X = x)\)
cumulative distribution function: \(F_{X}(x) = P_{X}(X \leq x) = \int_{-\infty}^x f(X) dX\)
probability density function: \(f_{X}(x) = \frac{d}{dX} F_{X}(x)\)
expected value \(\mathbb{E}(x)\): typical or average value for the random variable
\[\mathbb{E}(x) = \sum_i^n\color{#E69F00}{\underbrace{x}_\text{value of x}}\color{#56B4E9}{\overbrace{f_X(x)}^\text{probability of x}}\]
The expected value is the sum of the possible values weighted by the probability of that value.
Moments of a distribution are expectations.
\[ \mu'_n = \mathbb{E}X^n \]
Central Moments replace \(X\) with the mean centered value \(X-\mu\).
\[ \mu_n = \mathbb{E}(X-\mu)^n \]
mean: 1st moment
variance: 2nd central moment
skew: 3rd central moment
kurtosis: 4th central moment
\(\mathbb{E}(aX + b) = a*\mathbb{E}(X) + b\)
\(\mathbb{E}(X + Y) = \mathbb{E}(X) + \mathbb{E}(Y)\)
\(Var(aX + b) = a^2Var(X)\)
\(Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\)
\(Var(X - Y) = Var(X) + Var(Y) - 2Cov(X,Y)\)
Test yourself!
\(\mathbb{E}(X + 2Y); X \sim N(0,10); Y \sim N(2,2)\)
\(Var(X + 2Y); X \sim N(0,10); Y \sim N(2,2)\)
Don’t calculate, simulate!
# X ~ N(0,1)
X <- rnorm(1e5,0,1) # n,mu,sd
Z <- X*4.5 + 10
W <- X^2
print(paste0("Z: ", round(mean(Z),2), ", W: ", round(mean(W),2)))[1] "Z: 9.99, W: 1.01"
\(\mu\): mean/median/center
\(\sigma\): standard deviation
Note: when \(\nu\) > 30, the Student t distribution is quite close to \(N(0,1)\)
\(location\): center of distribution
\(scale\): width of distribution
Note: Cauchy distributions have undefined mean and variance. It’s the same as a student-t distribution where \(\nu =1\)
\(\mu\): mean of distribution
\(s\): scale of distribution
Note: you might be more familiar with the CDF of the logistic distribution: \(\frac{1}{1 + e^{-x}}\), from logistic regression
\(\alpha\): number of successes
\(\beta\): number of failures
Note: you can also parameterize the beta distribution with \(\mu = \frac{\alpha}{\alpha + \beta}\) and \(\sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}\) which is easier in some cases
\(n\): number of trials
\(p\): probability of success
Note: a Bernoulli distribution is just a binomial where \(n = 1\)
Note: Dirichlet is a generalization of the Beta distribution when there are more than 2 possible outcomes. Thank you to Andrew Heiss for a version of this plot code.
Note: mean = variance = \(\lambda\), sometimes counts have more variance than this, called overdispersion. In this case, a negative binomial distribution is more appropriate.
\(k\): shape
\(\theta\): scale
KNN Graph
❓ How would you identify an “influencer” in a graph?
Dihyat, M. M., Malik, K., Khan, M. A., & Imran, B. (2021). Detecting ideal Instagram influencer using social network analysis. arXiv preprint arXiv:2107.05731.
Requirements:
start/end at same node
do not pass through same node > 1
Directed: edges have a direction from one node to another
Acyclic: there’s no cycles
Graph: …
❓ Why would cycles be a problem in a causal DAG?
Clique: subsets of nodes where each pair of nodes in the clique are connected by an edge.
❓On a show like Love Island where only straight couples are allowed. What is the maximum clique size of a graph with nodes = contestants, and edges = couplings? Does it change based on the number of contestants?
Density: how many edges exist, compared to the number of possible edges. How “connected” the graph is.
\[ \rho = \frac{2m}{\underbrace{n(n-1)}_\text{possible edges}} \]
where \(m\) is the number of edges, and \(n\) is the number of nodes.
In DBSCAN, we make a graph where data points are nodes and edges exist between neighbors. We’re looking for sub-graphs that are densely connected.
minPts influences how dense the sub-graphs need to be, and eps determines which edges we’ll add to the graph as a whole.
\[ H(p) = - \sum_{i=1}^N p(x_i) \times log(p(x_i)) \]
A measure of disorder, uncertainty, surprise or chaos.
Information in an event: \(-log(p(x))\) , so \(H(p)\) gives you the expected information.
❓ When do we get the most information?
Entropy for fair coin toss:
\[ -(0.5 * log(0.5) + 0.5 * log(0.5)) \approx 0.3 \]
Entropy for unfair coin toss:
\[-(0.99 * log(0.99) + 0.01 * log(0.01)) \approx 0.02 \]
\[ H(p,q) = - \sum_{i=1}^N p(x_i) \times log(q(x_i)) \]
If we think \(q(x)\) is the probability distribution, but \(p(x)\) is the real probability distribution, how surprised will we be?
\[ H(A,B) = - \left (\frac{1}{8} * log_2 \left (\frac{7}{8} \right) + \frac{7}{8} * log_2 \left (\frac{1}{8} \right) \right) = 2.6 \]
If we think \(q(x)\) is the probability distribution, but \(p(x)\) is the real probability distribution, how surprised will we be?
Example: cross-entropy loss
Example: cross-entropy loss
Example: cross-entropy loss
\[D_{KL} \left (p||q \right ) = \underbrace{H(q,p)}_\text{cross entropy} - \underbrace{H(p)}_\text{entropy}\]
Measure of distance between distributions \(p\) and \(q\).
How much information do you lose if you approximate \(p\) with \(q\)?